Assessing Consumer Fraud Risk in Insurance Claims: an Unsupervised Learning Technique Using Discrete and Continuous Predictor Variables

نویسندگان

  • Jing Ai
  • Patrick L. Brockett
  • Linda L. Golden
چکیده

We present an unsupervised learning method for classifying consumer insurance claims according to their suspiciousness of fraud versus nonfraud. The predictor variables contained within a claim file that are used in this analysis can be binary, ordinal categorical, or continuous variates. They are constructed such that the ordinal position of the response to the predictor variable bears a monotonic relationship with the fraud suspicion of the claim. Thus, although no individual variable is of itself assumed to be determinative of fraud, each of the individual variables gives a ‘‘hint’’ or indication as to the suspiciousness of fraud for the overall claim file. The presented method statistically concatenates the totality of these ‘‘hints’’ to make an overall assessment of the ranking of fraud risk for the claim files without using any a priori fraud-classified or -labeled subset of data. We first present a scoring method for the predictor variables that puts all the variables (whether binary ‘‘red flag indicators,’’ ordinal categorical variables with different categories of possible response values, or continuous variables) onto a common 1 to 1 scale for comparison and further use. This allows us to aggregate variables with disparate numbers of potential values. We next show how to concatenate the individual variables and obtain a measure of variable worth for fraud detection, and then how to obtain an overall holistic claim file suspicion value capable of being used to rank the claim files for determining which claims to pay and the order in which to investigate claims further for fraud. The proposed method provides three useful outputs not usually available with other unsupervised methods: (1) an ordinal measure of overall claim file fraud suspicion level, (2) a measure of the importance of each individual predictor variable in determining the overall suspicion levels of claims, and (3) a classification function capable of being applied to existing claims as well as new incoming claims. The overall claim file score is also available to be correlated with exogenous variables such as claimant demographics or highvolume physician or lawyer involvement. We illustrate that the incorporation of continuous variables in their continuous form helps classification and that the method has internal and external validity via empirical analysis of real data sets. A detailed application to automobile bodily injury fraud detection is presented. * Jing Ai, PhD, is Assistant Professor in the Department of Financial Economics and Institutions, Shidler College of Business, University of Hawaii at Manoa, Honolulu, HI 96822, [email protected]. † Corresponding author. § Patrick L. Brockett, PhD, is Gus S. Wortham Chaired Professor in Risk Management and Insurance in the McCombs School of Business, Department of Information, Risk, and Operations Management, Global Fellow IC2 Institute, University of Texas at Austin, Austin, TX 78712, [email protected]. ** Linda L. Golden, PhD, is Marlene and Morton Meyerson Centennial Professor in Business in the McCombs School of Business, Department of Marketing, Global Fellow IC2 Institute, University of Texas at Austin, Austin, TX 78712, [email protected]. ASSESSING CONSUMER FRAUD RISK IN INSURANCE CLAIMS 439 1. IMPORTANCE OF THE PROBLEM AND OVERVIEW OF SOLUTION 1.1 Problem Importance The National Insurance Crime Bureau estimates about 10 percent of property and casualty insurance claims are fraudulent, which costs Americans $30 billion per year with annual average household insurance premiums $200 to $300 higher because of the cost of fraud. When indirect costs of fraud are incorporated, this cost may rise to $1,000 per year per family (Texas Department of Insurance 2006). More recently the Federal Bureau of Investigation (FBI) estimated these costs at more than $40 billion per year, resulting in an increase in family insurance premiums of between $400 and $700 per year (Federal Bureau of Investigation 2008). The Insurance Research Council puts the cost of automobile insurance fraud alone at $15–20 billion per year and estimates that approximately one-third of all bodily injury claims in automobile accidents have some degree of fraud in their claimed amount. If other lines of insurance are included, the total cost of insurance fraud may exceed $120 billion annually (IRC 2008). Thus, insurance fraud is a very serious problem, and the detection of fraud is quite important economically. Unfortunately the problem of detecting insurance fraud is very difficult. It does not lend itself to traditional supervised statistical classification methods such as logistic regression, because by the nature of insurance fraud, it is usually costly, and often impossible, to obtain a ‘‘training sample’’ of insurance claims with known unambiguous fraud labels (i.e., a variable indicating whether the claim is fraudulent or legitimate). This is required for estimating parameters in standard supervised classification models. Even if one could obtain these labels, they can be subject to classification errors that contaminate a supervised classification model. Finally, those who commit fraud specifically and thoughtfully attempt to cover their tracks so as to avoid being identified, and if a supervised model for detection is developed, the fraudsters can change their behavior to reduce the effectiveness of the detection method used. Therefore, supervised learning methods (e.g., logistic regression, discriminant analysis, support vector machines [SVM], Bayesian additive regression trees [BART], neural networks), although well developed and accessible for certain types of fraud detection (such as credit card fraud), may not be applicable to insurance fraud detection. Unsupervised classification methods not requiring knowledge of fraud labels for a subset of data are suitable for these insurance claim fraud detection applications. 1.2 Overview of Methodology and Solution Properties In much of the previous fraud literature involving polychotomous or continuous variables, the analysts group or ‘‘bin’’ the data to create discrete (usually binary) variables. This is done to ensure that the scaling is comparable across the variables. Binning, however, results in information loss. Once the variable has been binned, one next needs to choose a method of assigning numerical values to represent the categories for subsequent numerical statistical analysis. Commonly one assigns increasing integers to the categories (raw integer scoring). A flaw with this, however, is that a raw-integer scored variable with four possible values (1, 2, 3, 4) might have an extreme value listed as a ‘‘4,’’ whereas another binary variable with values (1, 2) might have an extreme value listed as a ‘‘2.’’ Making the scale so that the extreme values are compatible has led most authors to group or bin the data so every variable has an equal number of categorical possibilities. Because many fraud detection variables are already binary (e.g., ‘‘Was a police report filed at the scene?’’), in previous analysis the data were usually binned into binary variables. For the model developed in this paper, we introduce a method for assigning a numerical score for the predictor variables by extending a technique developed in the epidemiological literature called RIDIT analysis (RIDIT stands for Relative to an Identified Distribution Integral Transformation). This scoring method was invented by Bross (1958) for the analysis of discrete ordinal data. An overview of RIDIT analysis is given in Flora (1988). For an application to fraud detection it is important to extend Bross’s RIDIT scores to include continuous predictors as well. Many fraud predictor variables such as ‘‘time between the accident and 440 NORTH AMERICAN ACTUARIAL JOURNAL, VOLUME 13, NUMBER 4 the filing of the claim,’’ ‘‘time between the accident and the filing of the police report,’’ ‘‘age of the claimant,’’ and ‘‘ambulance charges’’ are continuous predictor variables (Viaene et al. 2002). Our method of scoring the discrete and continuous variables overcomes both the information loss associated with binning as well as the scaling issue (because all variables are put on a common [ 1, 1] scale). Brockett and Golden (1992) provide further intuitive justification for the RIDIT extension used herein and show that it also has better statistical properties than other competitor scaling methods, including raw integer scoring and conditional mean scoring, even for use involving other standard statistical methods such as regression and logistic regression. The unified method in this paper is an extension of the technique applied in Brockett et al. (2002) to binary predictors. The extension here allows a consistent analysis incorporating more variables and more information than before. In the model of this paper, each claim file is viewed conceptually as having a relative position on an underlying latent ‘‘fraud suspicion’’ dimension that underlies the dichotomous ‘‘fraud’’ label, and the concatenation of the ensemble of predictor variables in the claim file provides information concerning this position. The fraud predictor variables in a claim file are constructed by domain experts to be related to fraud suspicion level in a monotonic fashion, and we assume that this set of variables has been chosen for the analysis (Weisberg and Derrig 1991). The initial dichotomy of fraud classification is used in the claims settlement process for deciding whether or not to spend more resources investigating and negotiating the claim or to pay it immediately. The developed method provides three useful outputs: (1) an ordinal measure of overall claim file fraud suspicion level capable of being ranked across claims and of being used in other statistical analysis, (2) a measure of the importance of each individual predictor variable in assessing the suspicion level of claims, and (3) a classification function that can be applied to existing or new incoming claims. Other unsupervised methods (such as cluster analysis or Kohonen’s self-organizing feature map) are usually not able to perform one or more of the above functions, making them more difficult to interpret and less useful for fraud detection. After we present the unified method, we illustrate its application to insurance fraud detection. We provide external validation of its performance by assessing its classification accuracy on a data set that contains ‘‘fraud’’ labels. To further diagnose whether the fraud detection failures we uncover are due to deficiencies in our new method or deficiencies in the data set, we compare the classifications produced by this (unsupervised) method to those generated by a set of supervised learning methods that are optimized for performance on the data set. It is worth noting that these supervised methods are not really competitors for our method because they differ fundamentally in the prior information required for learning a model and in the implementation cost incurred in practice. Rather, the set of supervised learning methods (logistic regression, SVM, and BART) are used as a form of external validation of our method, just as the assessment of our method relative to the given ‘‘fraud’’ labels is best viewed as an external validation. Limited unsupervised methods are currently available for insurance fraud detection. Among the few, Brockett, Xia, and Derrig (1998) use the unsupervised Kohonen’s Feature Map (Kohonen 1989) for insurance fraud classification. In this paper we examine and discuss the relative strengths of our method against two competitor unsupervised techniques: Kohonen’s self-organizing feature maps and cluster analysis. The paper is organized as follows. The next section discusses our modification of the RIDIT scoring system of Bross (1958). Section 3 develops the unified unsupervised fraud classification method incorporating all types of predictor variables. More specifically, we develop an individual variable scoring method based on RIDITs, develop a measure of individual predictor variable importance, and create an overall claim file suspicion score using an iterative weight additive scoring method that shows its relation to principal component analysis. Section 4 demonstrates this method using a personal injury protection insurance claims data set from the Automobile Insurance Bureau in Massachusetts, where ‘‘fraud’’ labels are available, and assesses performance by comparison with supervised and unsupervised methods. Because this insurance fraud data set contains only binary predictor variables, in Section 5 ASSESSING CONSUMER FRAUD RISK IN INSURANCE CLAIMS 441 we introduce a second data set (an income classification data set) containing both discrete and continuous predictors to demonstrate why the extension to continuous predictors is desirable, that is, the information loss and performance deterioration due to data binning. Section 6 concludes the paper. 2. ADAPTING BROSS’S RIDIT SCORING METHOD TO SCORE FRAUD PREDICTOR VARIABLE RESPONSES WITH A VIEW TOWARD ASSESSING THE PREDICTIVE VARIABLES’ RELATIVE IMPORTANCE IN DISCERNING BETWEEN TWO CLASSES In the context of epidemiology, Bross (1958) presented a scoring mechanism for subjectively ordinal (rank-ordered) categorical data that are not necessarily metric (numerical) in nature (e.g., ‘‘degree of paleness’’). He called the scoring method (and the resulting analysis) RIDIT as an acronym for Relative to an Identified Distribution Integral Transformation to highlight that the scoring was relative to an identified probability distribution (usually some standard or control population distribution) and to signify that it was in the spirit of the usual probability integral transformation familiar to rank-ordered nonparametric statistics. In Bross’s RIDIT scoring, the score (numerical value) assigned to a particular response category i of a categorical variable is defined as Pj 1/2 Pi, where {Pi} is the probability j i of response i based on an identified reference distribution for the responses (Bross 1958). It can be observed that these scores represent the expected relative rank of the response category i with respect to the identified or standard distribution, with tied values counted as 1/2, divided by the sample size. Brockett and Levine (1977) and Brockett (1981) provide axiomatic foundations for RIDIT scoring, and Golden and Brockett (1987) show empirically that RIDIT scoring of rank-ordered categorical questionnaire data exhibits superior empirical performance relative to other standard scoring methods when used in subsequent statistical analysis. The scoring method used in our analysis generalizes this methodology to continuous as well as discrete scenarios. The variable score we use here is an adaptation (a linear transformation) of the original RIDIT score for discrete ordinal variables, which takes the empirical distribution for the sample as the identified comparison distribution (so the relative ranking in Bross’s context is with respect to the entire data set). Accordingly, in the discrete case we define the variable score Bit for a claim file that has a response in category i of the predictor variable t to be B P P , (2.1) it jt jt j i j i where Pjt is the observed proportion of responses in category j of variable t in the sample. Accordingly, the score Bit represents how ‘‘extreme’’ the response i to variable t is relative to the entire set of data. Suppressing the subscript t, if Ri is Bross’s RIDIT score for category i and the identified distribution used is the sample empirical distribution, the simple relation Bi 2Ri 1 holds. These variable scores (2.1) possess desirable properties. First, Bit is bounded in [ 1, 1]. Second, Bit is monotonically increasing in the rank of response i. Third, Bit is centered around zero, that is, it has mean equal to zero over the entire sample (Brockett et al. 2002). In the unsupervised learning context of this paper’s classification algorithm, it is assumed that the predictor variables have been constructed in such a fashion that the rank of response categories bears a monotonic relationship with the likelihood of being in the fraud class (i.e., without loss of generability, assume responses in lower-ranked response categories of a variable suggest higher suspicion of fraud). This is an intuitive assumption based on the domain knowledge. In essence, each predictor variable gives a ‘‘hint’’ as to the likelihood of the entity belonging to the fraud class, and the problem in this unsupervised learning context is how to put all these hints together to make a classification judgment and to derive related measures. Equation (2.1) is for ordinal discrete predictor variables; however, in many applications both continuous and discrete predictor variables are desired to be incorporated in a unified unsupervised classification method. 442 NORTH AMERICAN ACTUARIAL JOURNAL, VOLUME 13, NUMBER 4 To derive the analogue of Bit in the continuous case, note that from (2.1) the variable score Bi is the proportion of claim files within response categories ranked lower than iminus the proportion within higher ranked response categories (henceforth we suppress the index t when we focus on the calculation for a single variable t). In the continuous predictor case, if X is a continuous predictor having a monotonically decreasing relationship with the unobserved fraud suspicion level (i.e., a lower value of this variable leads to higher likelihood of fraud class membership), by analogue with (2.1), we define variable score B(x) as the proportion of claim files with response value less than x minus the proportion of claim files with response value larger than x. This, we have ˆ ˆ ˆ ˆ ˆ ˆ ˆ B(x) F(x ) ⎣1 F(x)⎦ ⎣F(x) P(x)⎦ ⎣1 F(x)⎦ 2F(x) 1 P(x), (2.2) where is the empirical distribution of X and is the sample proportion of response x. ˆ ˆ F(x) P(x) The variable score B(x) in (2.2) preserves the desirable characteristics from (2.1), that is, B(x) is bounded in [ 1, 1] and B(x) is monotonically increasing and centered around zero: K K ˆ ˆ ˆ ˆ E [B(x)] B(x )P(x ) [2F(x ) 1 P(x )]P(x ) p k k k k k ˆ k 1 k 1 K K K 2 ˆ ˆ ˆ 2 P(x ) P(x ) 1 [P(x )] l k k k 1 l 1 k 1 K K k 1 K 2 2 ˆ ˆ ˆ ˆ 2 [P(x )] P(x ) P(x ) [P(x )] 1 k k l k k 1 k 2 l 1 k 1 2 K ˆ P(x ) 1 0, k k 1 where variable X takes on an (increasingly ranked) value, x1, . . . , xK, in the sample. 3. A UNIFIED UNSUPERVISED CLASSIFICATION METHOD FOR DISCRETE AND CONTINUOUS PREDICTOR VARIABLES 3.1 Stochastic Dominance Assumption and Variable Construction As discussed in Section 1, an important assumption of our model is that the individual predictor variables are constructed so that the fraud class members tend to score lower on predictor variables than the nonfraud class members. In practice, this is not a difficult assumption to have hold, because the veracity of this assumption can be guaranteed by the choice of variables for the analysis (i.e., include only those that satisfy this assumption, or by rephrasing the variable description to obtain the desired ordering). Once this has been done, the individual predictor variables exhibit a first-order stochastic dominance relationship between the fraud class and the nonfraud class. Stated formally, let F1 ( ) be the distribution function of some variable t for class 1 (fraud class) and G2 ( ) be the distribution function for class 2 (nonfraud class). The distribution G2 ( ) first-order stochastically dominates F1 ( ) if and only if G2(x) F1(x) for all x, or equivalently, (x) F1(x) G2(x) 0 for all x (Mas-Colell, Whinston, and Green 1995). This is equivalent to our variable construction assumption that smaller response values suggest higher potential of fraud. We formally establish this in Proposition 1.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fraud Detection in Health Insurance Using Expert Re-referencing

Fraud is widespread and very costly to the healthcare insurance system. Fraud involves intentional deception or misrepresentation intended to result in an unauthorized benefit. It is shocking because the incidence of health insurance fraud keeps increasing every year. In order to detect and avoid the fraud, data mining techniques are applied. Frauds blow a hole in the insurance industry. Health...

متن کامل

Detecting Fraud in Health Insurance Data: Learning to Model Incomplete Benford's Law Distributions

Benford’s Law [1] specifies the probabilistic distribution of digits for many commonly occurring phenomena, ideally when we have complete data of the phenomena. We enhance this digital analysis technique with an unsupervised learning method to handle situations where data is incomplete. We apply this method to the detection of fraud and abuse in health insurance claims using real health insuran...

متن کامل

Fast Unsupervised Automobile Insurance Fraud Detection Based on Spectral Ranking of Anomalies

Collecting insurance fraud samples is costly and if performed manually is very time consuming. This issue suggests usage of unsupervised models. One of the accurate methods in this regards is Spectral Ranking of Anomalies (SRA) that is shown to work better than other methods for auto insurance fraud detection specifically. However, this approach is not scalable to large samples and is not appro...

متن کامل

FRAUD CLASSIFICATION USING PRINCIPAL COMPONENT ANALYSIS OF RIDITs

This article introduces to the statistical and insurance literature a mathematical technique for an a priori classification of objects when no training sample exists for which the exact correct group membership is known. The article also provides an example of the empirical application of the methodology to fraud detection for bodily injury claims in automobile insurance. With this technique, p...

متن کامل

Using Kohonen's Self-Organizing Feature Map to Uncover Automobile Bodily Injury Claims Fraud

Claims fraud is an increasingly vexing problem confronting the insurance industry. In this empirical study, we apply Kohonen's Self-Organizing Feature Map to classify automobile bodily injury (BI) claims by the degree of fraud suspicion. Feed forward neural networks and a back propagation algorithm are used to investigate the validity of the Feature Map approach. Comparative experiments illustr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010